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Abstract 

Background: In Gene Ontology, the "Molecular Function" (MF) categorization is a widely used knowledge 
framework for gene function comparison and prediction. Its structure and annotation provide a convenient way to 
compare gene functional similarities at the molecular level. The existing gene similarity measures, however, solely 
rely on one or few aspects of MF without utilizing all the rich information available including structure, annotation, 
common terms, lowest common parents. 

Results: We introduce a rank-based gene semantic similarity measure called InteGO by synergistically integrating the 
state-of-the-art gene-to-gene similarity measures. By integrating three GO based seed measures, InteGO significantly 
improves the performance by about two-fold in all the three species studied (yeast, Arabidopsis and human). 
Conclusions: InteGO is a systematic and novel method to study gene functional associations. The software and 
description are available at http://www.msu.edu/~jinchen/lnteGO. 



Background 

The Gene Ontology (GO) provides a structured, con- 
trolled vocabulary of terms, which are interrelated forming 
a directed acyclic graph (DAG) for describing and categor- 
izing (into three categories) the attributes for genes, gene 
products and sequences [1]. The "molecular function" 
(MF) category describes fundamental biochemical activities 
(including specific binding to ligands or structures of a 
gene product) at the molecular level [2]. As a popular 
resource used for functional annotation, MF provides rich 
information and a convenient way to study gene func- 
tional similarity by comparing terms with which the genes 
are annotated [3-7], which subsequently supports a wide 
variety of applications, such as assessing target gene func- 
tions [8], predicting gene functional associations [9], infer- 
ring protein nomenclature [10], predicting sub-cellular 
localization [11], discovering new pathways [12], etc. 
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In order to compute gene-to-gene functional similari- 
ties using GO, various computational approaches have 
been developed. These approaches can be classified into 
two distinct categories: 1) group-wise, meaning calculat- 
ing gene-to-gene similarity directly based on a statistical 
framework considering all the terms annotated to the 
target genes [13-15], and 2) pair-wise, i.e., indirectly 
computing gene-to-gene similarity using term-to-term 
similarities computed with GO semantic measures 
[12,16-21]. Each of the aforementioned measurements 
adopts one or a few kinds of knowledge in the GO effi- 
ciently. However, they do not rely on all of the rich 
information available in the GO databases. In this paper, 
we propose a new rank-based gene semantic similarity 
measure called InteGO (Integrated Gene Ontology mea- 
sure), which can integrate the state-of-the-art gene-to- 
gene measures [12,13,17] (therefore considering more 
information than these measures) to bring the perfor- 
mance of the GO-based functional similarity studies to 
a higher level. 

In the first GO-based measure category (group-wise), 
by combining elements of the topology and annotation 
information, the Yu measure calculates a probabilistic 
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level of similarity from GO, in order to directly compute 
gene similarity [13]. The main idea of the Yu measure is 
that a pair of genes should be very similar if they are 
included in a functional group with a few proteins, 
whereas the similarity is lower if the gene pair belongs 
to a large gene group. Mathematically, given two gene 
gl and gl, the gene-to-gene similarity can be calculated 
with: 

GeneSim Yu { gl ,g 2 ) = - In ^ (1) 

where n g \ tg2 is the total number of gene pairs that 
have the same set of lowest common ancestors (LCAs) 
as ,gi and g 2 ; N is the total number of gene pairs in the 
selected GO category. A LCA is the common ancestor 
with the highest information content (IC). In the illus- 
trative example in Figure 1, there are in total 45 gene 
pairs possible among the ten genes; the LCA of gene 
pair gi and g 2 is t lt and the number of gene pairs 
(which LCA is also tj) is 9. Therefore, the similarity of 
gi and g 2 based on the Yu measure is -/«(9/45) = 1.61. 
The Yu measure considers both the elements of topolo- 
gical distance and the LCA distance. However, it simpli- 
fies the computation of shared information of both 
genes without using all of the common parents of the 
GO terms annotated to gi or g 2 , which neglects the 
locations of LCAs and the aggregate semantic contribu- 
tions from the parents of the target terms (due to the 
high complexity of graph matching). Alternatively, the 




Figure 1 An illustrative example of Gene Ontology (GO) An 

illustrative example of GO forming a directed acyclic graph (DAG), 
in which nodes and edges represent GO terms and "is-a" or "part- 
of" relationships between terms, {f,, t 7 , root] is the set of GO 
terms, and {g, g 10 } is the set of genes annotated to these terms. 



SORA [15] measure computes the IC of a term set by 
means of combining inherited and extended information 
content of the terms based on the structure of GO. 
Gene functional similarity is estimated using the IC 
overlap ratio of term sets. However, like the Yu mea- 
sure, it ignores valuable information implicit in the 
semantics, i.e., the common parents of the GO terms, 
when calculating the shared IC and relationships among 
involved terms. 

In the measures in the second category (pair-wise), 
the pair-wise term comparisons originally developed 
for natural language processing [16,18-21] are utilized, 
and are strongly dependent on the specific taxonomy. 
Among the earlier developed methods, an IC based 
measure called the Resnik measure has showed strong 
correlations between its results and gene expression 
similarities on yeast [16,22]. Mathematically, given a 
GO term t, its IC is defined as a negative log likelihood 
IC(t) = - log(|G £ |/|G roof |), where G t and G root are the 
sets of genes annotated to term t and the root term 
(including all of its descendants) respectively. In the 
Resnik measure, the similarity between term t l and t 2 
is defined as the IC of LCA: TermSim Resnik (t 1; t 2 ) = IC 
{LCAi 2 ). Although the Resnik measure strongly corre- 
lated with the gene expression data [22], terms sharing 
the same LCA have the same semantic similarity, even 
if they are at very different levels of GO. Consequently, 
it cannot differentiate term pairs that are far from LCA 
with term pairs close to the same LCA. In the illustra- 
tive example in Figure 1, the common parent of t 2 and 
t 7 is ti, which is the same as the LCA of t 3 and t s . 
According to the Resnik measure, Sim Resnik {t 2 , t 7 ) = 
SirriResnik (t 3 , t 8 ) = 0.51, but clearly the distance from t 2 
and t 7 to the LCA is shorter. To take both the distance 
from LCA to the target terms and the distance from 
LCA to root into account [17], a later-developed mea- 
sure called the Schlicker measure was proposed: 

where Glca 12 is the set of genes annotated to the LCA 
of t\ and t 2 . 

In Eq 2, the first part on the right side of the equation 
quantifies the distance from terms t\ and t 2 to their 
LCA, and the second part measures the distance from 
the LCA to the root, where a short former distance and 
a long later distance indicate a higher similarity. Experi- 
mental results revealed that the Schlicker measure 
agrees with sequence similarity [17]. In the same exam- 
ple in Figure 1, the Schlicker measure is able to differ- 
entiate term pair (t 2 , t 7 ) and (f 3 , f 8 ) with TermSim Sc hiicker 
(t 2 , t 7 ) = 0.15 and TermSim Sc Micker(h, h) = 0.09. How- 
ever, the common problem of the Schlicker measure 
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and the Resnik measure is that they only consider a sin- 
gle common ancestor, neglecting the fact that two GO 
terms may have multiple common ancestors in the GO 
structure [23]. 

Recently, the Wang measure was proposed to consider 
all of the parent terms of the target terms [12]. Given a 
term t l and its parent term p, the semantic contribution 
of p to ti, denoted defined as the maximal 

semantic contribution of the paths from f x to p. The 
GO term similarity in the Wang measure is defined in 
Eq 3, where Pi (or P 2 ) are the sets of all of the parents 
of ti (or t 2 ). 



TermSimwangih, h) 



p€Pinp 2 
E Sti , p + s f2 



(3) 



tePi 



teP 2 



The experiment result shows that this measure per- 
forms significantly better than Resnik measure on yeast 
protein functional similarities [12]. However, the Wang 
measure ignores both the topological distances among 
the LCAs and the statistics of gene annotations that the 
Yu measure has taken into consideration. For the same 
example in Figure 1, to compare the similarity of term 
£ 3 and t s , all of the common parents of the two terms, 
P3 = ik, h, h, h, t 5 , root} and P 8 = {t v t 5 , t 6 , t 7 , t 8 , wot}, 
are considered by the Wang measure. 

For the Resnik, Schlicker and Wang measures, gene-to- 
gene similarity is computed based on the GO term simila- 
rities that annotate to the target genes. In Wang et al [12], 
let gi and g 2 be two genes and T x and T 2 be the sets of 
GO terms annotated to gi and g 2 , the gene-to-gene simi- 
larity is calculated by Eq 4: 



GeneSim[gi, g 2 ) - 



YJ TermSim{t,T 2 ) + £ TermSim(t,Ti) 

teTi telj (4) 

|Tl| + \T 2 \ 



where t is a GO term, TermSim(t, T x ) = max f(G ;r* 
Sim(f, t t ), which represents the highest similarity 
between t and term set T x . Note that, for both \ Ti\ 
and \ T 2 \, only the terms with T ermSim(t, T x ) * 0 are 
counted. 

To the best of our knowledge, the existing measures 
emphasize on only one or few types of relationships 
between genes but ignores the others. One of these 
measures may be better than the others on one specific 
set of terms and genes, but may perform worse than the 
other measures on another gene set. Since none of the 
existing measures takes into account all of the aspects 
of GO (structure, annotation, LCA, all of the common 
parent, etc), which is of course a challenging task, it is 
hypothesized that the integration of multiple measures 
can improve the performance, since integration of mul- 
tiple methods has been widely applied for performance 



boosting [24-26]. In this paper, we proposed a rank- 
based gene semantic similarity measure called InteGO 
by synergistically integrating the state-of-the-art gene- 
to-gene similarity measures. The integrated measures 
are called seed measures in the rest of paper. The major 
contributions of our work are: 

• While the existing measures only consider one or 
few aspects of the problem, InteGO is an integrative 
approach, which conceptually considers all of the 
information in GO to reduce incorrect score assign- 
ments. In addition, InteGO employs an adaptive 
approach for the optimization of the seed measure 
integration. 

• A rank-based approach is used to integrate multi- 
ple seed measures. Since the values from different 
seed measures have different scales and distributions, 
a direct integration of the values may lead to biased 
results. With our rank-based approach, InteGO uni- 
fies the scale and distribution among different seed 
measures, ensuring fair comparison. 

• InteGO is an open framework, which adds the 
flexibility to integrate more GO similarity measures, 
more advanced evaluation and integration methods 
in the future. 

InteGO was systematically tested on three species with 
different levels of complexity of GO annotations, i.e., 
yeast, Arabidopsis and human. The experimental results 
on all of the three species show that InteGO performs 
consistently better than the other measures in all of the 
tests. 

Method 

In order to integrate multiple seed measures in InteGO, 
two key problems need to be solved: first, how to select 
the most appropriate seed measures for integration; 
second, how to integrate all of the scores from the dif- 
ferent seed measures. To solve these problems, InteGO 
is divided into two steps: 1) to compute similarity scores 
with every seed measure individually and rank the 
scores, and 2) to evaluate and integrate the ranks of 
multiple seed measures. 

Rank-based similarity 

The outputs of the different gene-to-gene similarity mea- 
sures have different scales and distributions. Therefore, a 
direct integration of the values may lead to biased results. 
In InteGO, we unify the scale and distribution among dif- 
ferent seed measures with a rank-based approach. One 
common problem of rankbased approaches though is the 
data size dependence, i.e., while a rank-based approach 
can work well on a relative large dataset, it is often inade- 
quate on a small set of data. For example, if only two 
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genes are provided by a user, the similarity rank of the two 
genes is always one, regardless how high or how low the 
actual similarity score is. Therefore, instead of requiring 
users to always provide a large set of genes to compare 
(which is not reasonable all of the time), InteGO maintains 
a background set of genes (BG) for every species of inter- 
est to unify the similarity scores from the multiple seed 
measures. BG must satisfy two requirements: 1) it is large 
enough; 2) it unbiasedly includes the full spectrum of gene 
similarity scores, ranging from the lowest to the highest. 

The framework of InteGO is shown in Figure 2. In the 
steps with grey background, the similarity scores in BG 
are pre-calculated with all of the seed measures and 
saved in a database called GeneSimDB. When a user 
inputs a gene set G, the similarity scores of all of the 
gene pairs in G and all of the gene pairs between G and 
BG will be calculated with all of the seed measures, and 
be merged into GeneSimDB. If G is a subset of BG, 
InteGO will output the results directly. Finally, all of the 
gene pairs in GeneSimDB are sorted incrementally based 
on their gene similarity scores and are ranked. The 
ranked gene similarity score RankSim{gi, g 2 , m) for 
genes g 1 and g 2 in G is calculated as: 



RankSim(gi,g2, m) 



2 x r 



(\BG U G|) 



(5) 



where r™ g2 is the rank of gene pair g 1 and g 2 using 
seed measure m, and BG is the predefined background 
gene set, and G is the user provided gene set. The 
ranked similarity indicates how similar a given gene pair 
is in the background of all of the gene pairs. 



One advantage to use the rank-based measure is to 
unify different scales and distributions among the seed 
measures. Therefore, the agreement among the ranks 
could indicate the functional similarities appropriately. 
An illustrative example is shown in Table 1. Given ten 
gene pairs, three measures (M A , M B and M c ) are used 
to calculate the gene-to-gene semantic similarities 
based on the GO. The first column of the values show 
that the similarity scores of measure M A , M B and M c 
have different scales and different distributions. For 
example, the semantic similarity of gene pair 3 is 3.0 
for measure M A and 0.9 for measure M B , although they 
both mean the highest functional similarity in their 
own datasets. The second column of the values show 
the ranks of the gene pairs under each seed measure in 
assenting order. 

Adaptive integration approach 

The rank-based semantic similarities of gene pairs from 
every seed measure provide an unique opportunity to 
compute the gene-to-gene similarities with all the infor- 
mation of GO utilized by the seed measures. A key 
problem here is how to select the most appropriate inte- 
gration approach. Although there are many integration 
approaches all working well on certain domains, there 
does not exist one method that is always better than the 
others. In fact, to choose an appropriate integration 
method is largely dependent on the content of the study. 
Therefore, we propose an adaptive approach to automati- 
cally select the most appropriate integration method 
from a set of candidates. The main idea of the adaptive 
approach is to score all of the methods in the pool of the 



| Construct the background sets of genes BG1, 
BG2, BG3, ... 



Pre-calculate gene-to-gene similarities for 
gene pairs in SG„ saved in GeneSimDB 



Identify the best integration method for 
each background set BG, 






Obtain a set of genes 
in gene set G; an 
backgrou 


rom user input, saved 
i the selection of 
nd set BG, 







Calculate gene-to-gene similarities for gene 
pairs in 6 and for gene pairs between G and 
BG 



Append all the results to GeneSimDB and re- 
build index 



Rank all the gene pairs in GeneSimDB based 
on the similarity scores 



Compute Rank-based gene-to-gene 
similarity for all the genes in G 



Output results 



Figure 2 Framework of InteGO for calculating the rank-based gene-to-gene similarities in MF Framework of InteGO for calculating the 
rank-based gene-to-gene similarities in MF. The boxes in the grey block are the pre-processing modules for the preparation of the background 
gene set. 
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Table 1 Illustrative example for integration similarity. 


Gene Pairs 




semantic Similarity 






Rank of Similarity 






Integration of Ranks 






M A 


M B 


M c 




M B 


M c 


MAX 


MIN 


MEAN 


MEDIAN 


Gone pair 1 


2.4 


0.2 


0.04 


g 


2 


4 


0.9 


0.2 


0.5 


0.4 


Gene pair 2 


1.8 


0.6 


0.12 


6 


7 


8 


0.8 


0.6 


0.7 


0.7 


Gene pair 3 


3.0 


0.9 


0.03 


10 


10 


3 


1.0 


0.3 


0.8 


1.0 


Gene pair 4 


1.2 


0.3 


0.05 


5 


3 


5 


0.5 


0.3 


0.4 


0.5 


Gene pair 5 


0.9 


0.1 


0.06 


3 


1 


6 


0.6 


0.1 


0.3 


0.3 


Gene pair 6 


0.5 


0.5 


0.02 


2 


6 


2 


0.6 


0.2 


0.3 


0.2 


Gene pair 7 


1.0 


0.4 


0.09 


4 


4 


7 


0.7 


0.4 


0.5 


0.4 


Gene pair 8 


1.8 


0.4 


0.13 


6 


4 


9 


0.9 


0.4 


0.6 


0.6 


Gene pair 9 


0.2 


0.7 


0.01 


1 


8 


1 


0.8 


0.1 


0.3 


0.1 


Gene pair 10 


2.1 


0.8 


0.16 


8 


9 


10 


1.0 


0.8 


0.9 


0.9 



Illustrative example for integration similarity, where M A , M B and M c are three seed gene-to-gene functional similarity measures. 



candidate integration approaches with the background set 
BG, and then select the best one. 

InteGO provides four integration methods: max, min, 
mean and median. As an open system, InteGO also 
allows users to use their own integration methods. 
Mathematically, let RankSim(gl, g% m) be rank-based 
similarity of gene gl and g2 using seed measure m, 
InteGO is defined as: 



max m <= M RankSim(gi,g 2 ,m) if 1 = max 

min mEM RankSim(g 1 ,g 2 , m) if I = min 

mean m( =MRankSim(gi , g 2 , m) if 1 = mean 

median mEM RankSim(gi, g 2 , m) if I = median 
integralion meM RankSim{gi,g 2l m) if I = other .integration 



(6) 



where M is a set of seed measures and / is an integra- 
tion method which is max, min, mean, median of all of 
the ranks, or any other integration method that is defined 
by the user. For an illustrative example in Table 1, the 
results based on the four different integration methods 
are shown in the third column. 

To automatically determine which integration method 
is the best, all of the gene pair similarities in BG are cal- 
culated based on each candidate integration method and 
are evaluated systematically with biological data. Recent 
studies used the correlation coefficient of gene expression 
correlations or gene sequence similarities to evaluate the 
MF based gene similarities [22]. However, it is not always 
correlated between gene functional similarities and gene 
expression correlation or sequence similarities [12]. 
Furthermore, previous studies show that enzymes are 
usually categorized biochemically with EC (Enzyme 
Commission) numbers but not their nucleotide or amino 
acid sequences [27,28], which indicates that it could be a 
better way for using EC numbers to explain molecular 
function with the criteria that the molecular functions of 
a group of genes are similar if they have the same EC 
numbers [12,29,30]. 

To systematically use EC to choose an integration 
method, all of the genes in BG are grouped based on their 



EC numbers (four digits), and then the differences 
between the inter- and intra-EC gene-to-gene similarities 
are tested. With an integration method, the higher the 
ratio between intra-EC gene similarities and inter-EC gene 
similarities, the better the integration method is. Quantita- 
tively, we utilize the logged fold change (LogFC) measure 
which has been widely used in the gene expression studies 
[31]. The LogFC score of EC ei is defined in Eq 7: 



LogFC^) = -1 x £ 

ej€E;G(ej)nG(ei)=e 



E diff g {ei,ej) 
\G{e t )\ 



where G(e/) is set of all of genes which EC number is 
e,; £ is a set of ECs which do not have overlapped genes 
with e t (G(e y ) n G(e,) = 0); diff g {e b ej) is computed as: 

\G{e,)\ x ■£ (l-GeneSim(g,g)+c) 
diff g {e it e,) = In - - '- ^ _ ^—^—^ - ^(8) 

g*eG{ei) 

where c is a Laplacian smoothing parameter which is a 
constant small positive number; G(e ; ) is the set of all of 
the genes assigned to EC e,- except gene g; G(e y ) is the set 
of all of the genes assigned to EC ef, g is a gene assigned to 
e ; . In Eq 8, the numerator represents the inter-EC distance 
and the denominator represents the intra-EC distance. 
The higher the diff g {ei, e ; ) is, the more obvious the positive 
difference between inter-EC difference and intra-EC differ- 
ence is. 

For example, given nine genes in BG, four of which have 
the same EC number, labeled as e lt and the other five 
genes belong to another EC number, labeled as e 2 . To cal- 
culate the LogFC score for e lt we first compute diff g {e lt e 2 ) 
with Eq 8, meaning that every gene in e-i is compared with 
every other gene in e\ for the average intra-EC difference, 
and then every gene in el is compared with every gene in 
e 2 to get the inter-EC differences. logFC(ei) is the average 
of all of the diff g (e lt e 2 ) scores for the genes assigned to e v 
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The method that has the highest LogFC scores for all of 
the ECs are considered as the most appropriate integration 
method for BG. If a user input set G is much smaller than 
BG (which often happens), we assume the selected method 
is also the most suitable for G U BG. If the size of G is 
comparable to BG, it is not necessary to use BG, then the 
integration method shall be selected based on the evalua- 
tion on G. 

Results 

To systematically evaluate the performance of InteGO, 
we tested it on three model organisms with different 
levels of GO annotation scale and complexity. For each 
of them, we adopted EC numbers and protein sequences 
as independent biological evidences. 

Data preparation 

The GO annotation and structure data were down- 
loaded from the GO website (http://www.geneontology. 
org/GO. downloads. shtml). To systematically evaluate 
different GO-based gene-to-gene similarity measures on 
MF, the pathway and EC number information of Yeast, 
Arabidopsis were downloaded from the Saccharomyces 
genome database (http://biocyc.org/YEAST/organism- 
summary?object=YEAST), PlantCyc (http://ftp.plantcyc. 
org/Pathways) and HumanCyc (http://humancyc.org) 
respectively. Note that our EC based evaluation method 
requires that an EC has at least two genes. In yeast, 
Arabidopsis and human, 95, 325 and 312 ECs satisfy the 
criteria. The protein sequences were downloaded from 
the Saccharomyces genome database (http://www.yeast- 
genome.org/download-data/sequence), TAIR (http:// 
www.arabidopsis.org/tools/bulk/sequences/index.jsp) and 
UniProt (http://www.uniprot.org) respectively. 

Let E be the set of all of the ECs that have at least one 
gene assignment, we define BG as the set of all of the genes 
that has at least one EC assignments in E. This definition of 
BG ensures that for any gene in BG the intra-EC similarity is 
valid. The sizes of BG are 218, 1,348 and 1,504 for yeast, 
Arabidopsis and human respectively. An experiment on the 
variation of the background set (see Additional file 1) reveals 
that the use of a relatively smaller background set may affect 
performance significantly. Additional file 2, 3 and 4 show 
that the distribution of the gene-to-gene similarities with Yu, 
Schlicker and Wang measures, where the similarity scores 
are spread in the full spectrum of the range. In summary, 
the background gene sets are well prepared. 

InteGO was implemented with Java JDK 1.6 and 
JUNG library [32]. The experiments were run on a win- 
dows 7 computer with Intel i7 CPU and 10 GB RAM. 

Selecting seed measures 

In order to select the most appropriate seed measures for 
InteGO, we screened four existing measures (Yu, Resnik, 



Schlicker and Wang) using the EC based evaluation 
method. Figure 3 shows that for the Yu, Schlicker and 
Wang measures, it is not distinguishable that one measure 
is significant better than another. The Yu, Schlicker and 
Wang measures all performed the best on yeast with the 
highest median value. The Schlicker measure performs 
best on Arabidopsis, while the Yu measure is best on 
human. Therefore, we chose all of the three as the seed 
measures in InteGO. We did not choose the Resnik 
measure, because it is clearly not as good as the other mea- 
sures in all of the three species. Note that the upper-bound 
and the lower-bound of the LogFC scores in Figure 3 were 
set to 5 and -0.05 respectively to eliminate outliers. 

In addition, Figure 4 shows that although all of the 
three seed measures perform equally well in some ECs, 
each measure has its own favorable EC groups. For 
example, the Schlicker and Wang measures perform the 
best in 51 and 52 out of the total 325 Arabidopsis ECs 
respectively (see Figure 4(b)), which is greater than the 
Yu measure (20 ECs). However, the Yu measure per- 
forms the best in 159 out of the total 315 human ECs, 
which dominant the EC group distribution in human 
(see Figure 4(c)). Therefore, an appropriate integration 
of these measures may combine the advantages of differ- 
ent measures and improve the overall performance. 
Note that although only four measures were screened in 
the experiment, more measures can be evaluated and 
added later since InteGO has an open framework. 

Selecting integration method 

In order to select the most appropriate integration 
method, four different approaches (MAX, MIN, MEAN 
and MEDIAN) were tested and compared. Figure 5 
shows that MAX performs the best among the four 
integration methods. In yeast, although almost all of the 
measures have the same median value, the 25th percen- 
tile of MAX is 5, significantly higher than the Yu, 
Schlicker and Wang measure (1.68, 3.00 and 2.04 
respectively) and the other integration methods. In 
Arabidopsis and human, the median of MAX are both 
5, which is also significantly higher than that of all of 
the other integration methods. It indicates that the per- 
formance of MAX, a simple integration approach, has 
been increased to around 2-fold. This is because the 
integration considers all of the aspects of GO, while an 
individual seed measure, although nicely designed, is 
compromised in that it focuses on only one of few kinds 
of knowledge in GO. The other integration measures, 
especially MIN, however, cannot distinguishably 
improve the gene similarity performance. As shown in 
Figure 5(c), the result of MIN is even worse than the 
seed measures. It indicates that the performance of 
gene-to-gene similarity could be significantly improved 
only by the appropriate integration. 
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Figure 3 Logged fold change (LogFC) score comparison. Logged fold change (LogFC) score comparison for four different similarity measures 
in Molecular Function (MF) category on yeast (a), Arabidopsis(b) and human (c). 



As mentioned in the previous section, the seed mea- 
sures have their own favorable EC groups. To test 
whether MAX take advantage of all of the strength of 
the seed measures, we compared MAX with the Yu, 
Schlicker and Wang measure on all of the ECs. Figure 6 
(a), (b) and (c) show that MAX dominant the EC 
groups, clearly different to the results in Figure 4. In 
detail, MAX performs the best in 140 and 172 out of 
325 and 315 ECs in Arabidopsis and human respectively, 
while the numbers are only 2, 9, 6 in Arabidopsis and 2, 
2, 1 in human for the Yu, Wang and Schlicker measures 
respectively. In summary, the experiment indicates that 
integrating multiple measures could improve the perfor- 
mance of gene similarity measurement and MAX is the 
best integration method. 

Statistics analysis was carried out to test whether the 
results of the best integration measure (MAX) of 
InteGO is statistically the best. We compared InteGO 



with the three seed measures using TukeyHSD test [33]. 
The p-values shown in Table 2 and the 95% family-wise 
confidence level (Additional file 5, 6 and 7) indicate that 
the results of MAX are significant better than the results 
of all of the seed measures in yeast, Arabidopsis and 
human, with the only exception that the Schlicker mea- 
sure's results are comparable in yeast, in that the 
Schlicker measure performs very well in yeast, so there 
is little room for InteGO to improve. 

Protein sequence based performance evaluation 

In addition to use EC as the evaluation criteria, protein 
sequence similarities were employed as independent evi- 
dence for further performance study. Although the corre- 
lation coefficient between semantic similarity and 
sequence similarity is not as strong as EC, it is generally 
accepted that as sequence similarity increases, so does 
the chance that these proteins are homologues, in which 
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Figure 4 Venn Diagram for Yu measure, Schlicker measure and Wang measure with number of ECs on which perform best on yeast 
(a), Arabidopsis (b) and human (c). Venn Diagram for Yu measure, Schlicker measure and Wang measure with number of ECs on which 
perform best on yeast (a), Arabidopsis (b) and human (c). Blue, green and yellow represent Yu measure, Schlicker measure and Wang measure 
respectively. 
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Figure 5 Logged fold change (LogFC) score comparison for four integration measures and three integrated measures. Logged fold 
change (LogFC) score comparison for four integration measures (MAX, MIN, MEAN and MEDIAN) and three integrated measures (Yu measure, 
Schlicker measure and Wang measure) in Molecular Function (MF) category on yeast (a), Arabidopsis (b) and human (c). 



case they are likely to have identically annotated molecu- 
lar functions [34]. In our test, sequence similarity scores 
(In(BitScore)) of all genes in the BG of the three species 
were calculated with BLAST, resulting in 20,652 yeast, 
772,609 Arabidopsis and 942,609 human gene pairs. As 
shown in Figure 7, the semantic similarity measurements 
show a correlation with sequence similarity. The covar- 
iance scores (see Additional file 8) on all of the three spe- 
cies reveal that InteGO is overall the best measure. 

Conclusions 

Comparing gene at the functional level is vital for various 
of applications [3-7]. The existing GO semantic based 
measures either calculate gene-to-gene similarities 
directly [13], or indirectly compute gene-to-gene similari- 
ties with term-to-term similarities [12,17]. Unfortunately, 
none of them takes into account all of the respects of 
rich information in GO (structure, annotation, LCA and 
all of the parents term, etc). In this paper, we proposed a 
new measure called InteGO to appropriately integrate 



the seed measures with the following advantages: 1) 
InteGO employs an adaptive approach which enables the 
optimization of seed measure integration; 2) it applies a 
rank-based integration approach, which unifies the scale 
and distribution differences among different seed mea- 
sures; 3) InteGO is an open-platform measure that allows 
users to add/delete seed measures, redefine the back- 
ground gene set and change the rank-based integration 
method. 

To demonstrate the advantages of InteGO, we compared 
its EC-assigned gene similarities and sequence similarities 
with three existing measures (the Yu, Schlicker and Wang 
measure) in yeast, Arabidopsis and human. Comparing 
with these state-of-the-art measures, the experimental 
results show that InteGO increases the LogFC scores to 
about two-fold. It indicates that integrating multiple mea- 
sures appropriately can improve the performance of the 
functional similarity measure. Especially, we found that 
taking the maximal ranks from all of the seed measures 
performs the best. The covariances between semantic 
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Figure 6 Venn Diagram for the best integration measure MAX and Yu measure, Schlicker measure and Wang measure Venn Diagram 
for the best integration measure MAX and Yu measure, Schlicker measure and Wang measure with number of ECs on which perform best on 
yeast (a), Arabidopsis (b) and human (c). Purple, blue, green and yellow represent MAX measure, Yu measure, Schlicker measure and Wang 
measure respectively. 
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Table 2 Adjusted P-values for comparing MAX with Yu, Schlicker and Wang measure using TukeyHSD. 



Measures 




Adjusted p-value 






yeast 


Arabidopsis 


human 


MAX vs. Schlicker 


2.8E-1 


<1.0E-7 


<1.0E-7 


MAX vs. Wang 


1.0E-2 


<1.0E-7 


<1.0E-7 


MAX vs. Yu 


1.1 E-4 


<1.0E-7 


<1.0E-7 


Wang vs. Schlicker 


5.9E-1 


9.6E-1 


3.2E-1 


Yu vs. Schlicker 


6.0E-2 


<1.0E-7 


1.9E-1 


Yu vs. Wang 


5.8E-1 


<1.0E-7 


1.2E-3 



Adjusted P-values for comparing MAX with Yu, Schlicker and Wang measure using TukeyHSD. Significant p-values are in bold fonts. 



ft****** 



Schlicker 
Wang 



A A * ^ 



Schlicker 



3.0 3.5 40 

Seqoence Similarity 



Sequence Similarity 



Sequence Similarity 



Figure 7 Comparing InteGO with the Yu, Schlicker and Wang measures with protein sequence similarity. Comparing InteGO 
Schlicker and Wang measures with protein sequence similarity on yeast (a), Arabidopsis (b) and human (c), where the x-axis is BLAST 
similarity (ln(BitScore)) and y-axis is the normalized semantic similarity based on GO. 
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similarities and protein sequence similarities shows 
InteGO is clear the best out of all the tested measures. 

In InteGO, to maintain a large background gene set is 
expensive. Therefore, extending InteGO from MF to BP 
or even other biological or medical ontologies is not a 
trivial problem. In the future, we will continue to 
improve InteGO to be more efficient and to be applic- 
able on more ontologies. As an open framework, the 
performance of InteGO may be further improved by 
synergistically integrating more seed measurements. We 
will continue to integrate and compare InteGO with 
more recent gene-to-gene measurements in the future. 
We will continue to explore better integration methods, 
such as using EM algorithm to optimize the weight for 
each seed measure, to achieve better performance. 

Additional material 



Additional file 1: Average LogFC scores for different sizes of 
background set. To test whether the selection of BG will affect the 
integration performance, we compared the results for different 
background set on yeast. First, given the full set of BG, a subset of gene 
pairs were randomly selected with the percentage varying from 10% to 
100%. This process was repeated for 100 times. Second, as shown in 
Additional file 1, the logFC scores for each subset size were calculated 



based on the randomly selected gene pairs. Since we do not use the full 
set, the computable ECs are also a subset of all of the computable ECs. 
In Additional file 1, the LogFC score increases linearly from 0 to 10 when 
the coverage increases from 10% to 90%, then suddenly jumps to a high 
score (13.8) when all of the background genes were used, indicating that 
first, the size of the background set affects the integration measure 
significantly, second, to use the full background set is the best, although 
it slightly increases the computational time. 

Additional file 2: Distribution of the gene-to-gene similarities with 
Yu measure. Distribution of the gene-to-gene similarities with Yu 
measure for all of the genes in the Background Gene Set (BG) on yeast. 

Additional file 3: Distribution of the gene-to-gene similarities with 
Schlicker measure. Distribution of the gene-to-gene similarities with 
Schlicker measure for all of the genes in the Background Gene Set (BG) 
on yeast. 

Additional file 4: Distribution of the gene-to-gene similarities with 
Wang measure. Distribution of the gene-to-gene similarities with Wang 
measure for all of the genes in the Background Gene Set (BG) on yeast. 

Additional file 5: The 95% family-wise confidence level of 
TukeyHSD test on yeast. The 95% family-wise confidence level of 
TukeyHSD test on yeast, which compared MAX with all the three seed 
measures (Schlicker, Wang and Yu measure). 

Additional file 6: The 95% family-wise confidence level of 
TukeyHSD test on Arabidopsis. The 95% family-wise confidence level 
of TukeyHSD test on Arabidopsis, which compared MAX with all the 
three seed measures (Schlicker, Wang and Yu measure). 

Additional file 7: The 95% family-wise confidence level of 
TukeyHSD test on human. The 95% family-wise confidence level of 
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TukeyHSD test on human, which compared MAX with all the three seed 
measures (Schlicker, Wang and Yu measure). 

Additional file 8: The covariance sores comparing with sequence 
similarity. The covariance sores comparing with sequence similarity on 
yeast, Arabidopsis and human for Max, Yu, Schlicker and Wang measure. 
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